Contents
Table of Contents
Next
Ensemble Learning

9.4.2. Optimization Methods

Like other machine learning algorithms, deep learning aims to find a model that can best describe the knowledge behind the data. However, deep learning models, i.e., deep neural networks, are much more complicated than models established using other machine learning algorithms. Due to this fact, the search for the optimal model will require an iterative optimization process to minimize the loss function. A method or an algorithm that can dictate this search process is called a solver or an optimizer. Solvers are an essential part of deep learning because we usually need to specify which solver to use. Also, hyperparameters need to be determined to fix the solver. Common solvers used in machine learning include SGD, Momentum, NAG, Adagrad, Adadelta, RMSProp, AdaMax, and Nadam.

SGD

In a broad sense, SGD has three types: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. In many cases, SGD is thought to be mini-batch gradient descent. The essence of all of the three types is gradient descent as follows.
(9.43) w t + 1 = w t + Δ w t (9.43) w t + 1 = w t + Δ w t {:(9.43)w_(t+1)=w_(t)+Deltaw_(t):}\begin{equation*} w_{t+1}=w_{t}+\Delta w_{t} \tag{9.43} \end{equation*}(9.43)wt+1=wt+Δwt
This gradient is adopted to update the model parameters as
(9.44) Δ w t = η g t (9.44) Δ w t = η g t {:(9.44)Deltaw_(t)=-eta*g_(t):}\begin{equation*} \Delta w_{t}=-\eta \cdot g_{t} \tag{9.44} \end{equation*}(9.44)Δwt=ηgt
where g t g t g_(t)g_{t}gt is the gradient, w w www is the model parameter to be updated, the subscripts t t ttt and t + 1 t + 1 t+1t+1t+1 are the current and following steps, respectively, η η eta\etaη is the learning rate, and f f fff is the objective/loss function.
The gradient is calculated as
(9.45) g t = w f ( w t ) (9.45) g t = w f w t {:(9.45)g_(t)=grad_(w)f(w_(t)):}\begin{equation*} g_{t}=\nabla_{w} f\left(w_{t}\right) \tag{9.45} \end{equation*}(9.45)gt=wf(wt)
Batch gradient descent uses all the samples, i.e., the whole training set as the batch, to calculate the gradient and then use this gradient to update the model parameters. If the loss function is a convex function, then we can expect to find the global optimal (minimal) value, i.e., the global minimal loss function value and the corresponding model parameters, if the learning rate is small enough. However, the loss function of deep learning problems is, in general, very complicated and usually is not convex. Thus, the identification of the global minimum is usually infeasible. Notwithstanding, a local minimal value can still ensure a decent model. Batch gradient descent can guarantee a local optimal value and can be relatively easy to implement. However, the method poses a high demand for memory as all the data needs to be processed together. Due to this reason, the gradient computation will take a long time.
Stochastic gradient descent takes a way that is completely opposite to batch gradient descent. Stochastic gradient descent calculates the gradient and updates the model after every sample is processed. Thus, this essentially uses the gradient determined by one sample, or viewed as the "slope" of a small area in the objective function, to determine which direction to go for optimizing the loss function. Because of the differences between samples - different samples may suggest much different or even opposite directions - sometimes it will take much more effort to find the best optimization direction. Also, due to the same reason, stochastic gradient descent is less likely to be trapped at a local optimal point compared with batch gradient descent. Meanwhile, stochastic gradient descent cannot utilize array operations to improve computational efficiency.
Mini-batch gradient descent is proposed to reach a compromise between batch gradient descent and stochastic gradient descent. It uses a subset of the training data as a mini-batch to calculate the gradient and update the model parameters. This method is adopted by many deep learning packages as the default SGD or even the default solver.
The selection of the learning rate in SGD is very difficult. In many cases, this can only be done via experience and/or trial and error. Please note that the difficulty of training deep neural networks is not only caused by the local minimum. The existence of saddle points can also trap the solver. A saddle point has zero slopes (derivatives) in orthogonal directions. Thus, the solver may think the global optimum has been reached as zero gradients are found in both directions.
Figure 9.11: Gradient descent method with manual learning rate

Momentum

As shown in Fig. 9.11, SGD only relies on the gradient of the current batch. Therefore, in the early stage of training, the update can be unstable, while in the late stage, the solver does not have enough power to jump off local optima or saddle points. The Momentum method incorporates the gradient of the previous update step in addition to the use of the gradient of the current step. In this way, the optimization direction of the previous update is included in the gradient of the current step as a "momentum".
The update is calculated as
(9.46) Δ w t = ρ Δ w t 1 η g t = ρ w f ( w t 1 ) η g t (9.46) Δ w t = ρ Δ w t 1 η g t = ρ w f w t 1 η g t {:(9.46)Deltaw_(t)=rho*Deltaw_(t-1)-eta*g_(t)=rho*grad_(w)f(w_(t-1))-eta*g_(t):}\begin{equation*} \Delta w_{t}=\rho \cdot \Delta w_{t-1}-\eta \cdot g_{t}=\rho \cdot \nabla_{w} f\left(w_{t-1}\right)-\eta \cdot g_{t} \tag{9.46} \end{equation*}(9.46)Δwt=ρΔwt1ηgt=ρwf(wt1)ηgt
Then, the update on the model parameter is performed as
(9.47) w t + 1 = w t + Δ w t = w t + ρ w f ( w t 1 ) η g t (9.47) w t + 1 = w t + Δ w t = w t + ρ w f w t 1 η g t {:(9.47)w_(t+1)=w_(t)+Deltaw_(t)=w_(t)+rho*grad_(w)f(w_(t-1))-eta*g_(t):}\begin{equation*} w_{t+1}=w_{t}+\Delta w_{t}=w_{t}+\rho \cdot \nabla_{w} f\left(w_{t-1}\right)-\eta \cdot g_{t} \tag{9.47} \end{equation*}(9.47)wt+1=wt+Δwt=wt+ρwf(wt1)ηgt
The gradient can be large in the early stage, so ρ ρ rho\rhoρ typically adopts a relatively small value like 0.5 . As the gradient decreases in the later training stage, ρ ρ rho\rhoρ can be increased to values like 0.9.

Nesterov

The Nesterov or Nesterov Momentum can be viewed as a modification of the Momentum method. Compared with Momentum, Nesterov uses the gradient at a "future" point for update. For this purpose, we first move from the current point by the momentum to reach this future point at w t + ρ Δ w t 1 w t + ρ Δ w t 1 w_(t)+rho*Deltaw_(t-1)w_{t}+\rho \cdot \Delta w_{t-1}wt+ρΔwt1 (denoted as Step t 1 2 t 1 2 t-(1)/(2)t-\frac{1}{2}t12 ). Next, we calculate the gradient at this intermediate point, g t + 1 2 g t + 1 2 g_(t+(1)/(2))g_{t+\frac{1}{2}}gt+12. The momentum and the gradient will be added up as Δ t Δ t Delta_(t)\Delta_{t}Δt for update. Mathematically, this can be written as
(9.48) Δ w t = ρ Δ w t 1 η g t + 1 2 = ρ Δ w t 1 η f w ( w t + ρ Δ w t 1 ) (9.48) Δ w t = ρ Δ w t 1 η g t + 1 2 = ρ Δ w t 1 η f w w t + ρ Δ w t 1 {:(9.48)Deltaw_(t)=rho*Deltaw_(t-1)-eta*g_(t+(1)/(2))=rho*Deltaw_(t-1)-eta*gradf_(w)(w_(t)+rho*Deltaw_(t-1)):}\begin{equation*} \Delta w_{t}=\rho \cdot \Delta w_{t-1}-\eta \cdot g_{t+\frac{1}{2}}=\rho \cdot \Delta w_{t-1}-\eta \cdot \nabla f_{w}\left(w_{t}+\rho \cdot \Delta w_{t-1}\right) \tag{9.48} \end{equation*}(9.48)Δwt=ρΔwt1ηgt+12=ρΔwt1ηfw(wt+ρΔwt1)
The learning rate in the above gradient descent is critical to the performance of this method. However, these gradient descent methods feature the use of a learning rate that is manually adjusted. The adjustment of the learning rate can significantly determine the learning outcome.

AdaGrad

AdaGrad is proposed to automatically adjust the learning rate. The update on the model parameters is performed as follows:
(9.49) Δ w t = η τ = 1 t g τ 2 + ϵ g t (9.49) Δ w t = η τ = 1 t g τ 2 + ϵ g t {:(9.49)Deltaw_(t)=-(eta)/(sqrt(sum_(tau=1)^(t)g_(tau)^(2)+epsilon))*g_(t):}\begin{equation*} \Delta w_{t}=-\frac{\eta}{\sqrt{\sum_{\tau=1}^{t} g_{\tau}^{2}+\epsilon}} \cdot g_{t} \tag{9.49} \end{equation*}(9.49)Δwt=ητ=1tgτ2+ϵgt
where η η eta\etaη is the initial learning rate, and ϵ ϵ epsilon\epsilonϵ is a small number used to avoid a zero denominator.
As can be seen in the above equation, the learning rate will gradually decrease as the update distance ( Δ w t Δ w t sum Deltaw_(t)\sum \Delta w_{t}Δwt ) increases.

AdaDelta

AdaDelta is proposed to address three issues with AdaGrad: 1) the learning rate can only monotonically decrease, which can lead to excessively small learning rates in late training stages, 2) the units on two sides of the above AdaGrad update equation are not consistent, and 3) the initial learning rate still needs to be set manually.
As for 1), AdaDelta only uses the expectation of g t 2 g t 2 g_(t)^(2)g_{t}^{2}gt2 instead of the sum.
(9.50) Δ w t = η E [ g 2 ] t + ϵ g t (9.50) Δ w t = η E g 2 t + ϵ g t {:(9.50)Deltaw_(t)=-(eta)/(sqrt(E[g^(2)]_(t)+epsilon))*g_(t):}\begin{equation*} \Delta w_{t}=-\frac{\eta}{\sqrt{\mathbb{E}\left[g^{2}\right]_{t}+\epsilon}} \cdot g_{t} \tag{9.50} \end{equation*}(9.50)Δwt=ηE[g2]t+ϵgt
The expected value can be calculated using the recent values as
(9.51) E [ g 2 ] t = ρ E [ g 2 ] t 1 + ( 1 ρ ) g t 2 (9.51) E g 2 t = ρ E g 2 t 1 + ( 1 ρ ) g t 2 {:(9.51)E[g^(2)]_(t)=rho*E[g^(2)]_(t-1)+(1-rho)*g_(t)^(2):}\begin{equation*} \mathbb{E}\left[g^{2}\right]_{t}=\rho \cdot E\left[g^{2}\right]_{t-1}+(1-\rho) \cdot g_{t}^{2} \tag{9.51} \end{equation*}(9.51)E[g2]t=ρE[g2]t1+(1ρ)gt2
As for 2) and 3), we can resort to Newton's method and obtain the following update equation:
(9.52) Δ w t = E [ w 2 ] t 1 E [ g 2 ] t + ϵ g t (9.52) Δ w t = E w 2 t 1 E g 2 t + ϵ g t {:(9.52)Deltaw_(t)=-(sqrt(E[w^(2)]_(t-1)))/(sqrt(E[g^(2)]_(t)+epsilon))*g_(t):}\begin{equation*} \Delta w_{t}=-\frac{\sqrt{\mathbb{E}\left[w^{2}\right]_{t-1}}}{\sqrt{\mathbb{E}\left[g^{2}\right]_{t}+\epsilon}} \cdot g_{t} \tag{9.52} \end{equation*}(9.52)Δwt=E[w2]t1E[g2]t+ϵgt

RMSprop

RMSprop is a variation or a special case of AdaDelta. It is between AdaGrad and AdaDelta. First, the ρ ρ rho\rhoρ value is fixed at 0.5 .
(9.53) E [ g 2 ] t = 0.5 E [ g 2 ] t 1 + ( 1 0.5 ) g t 2 (9.53) E g 2 t = 0.5 E g 2 t 1 + ( 1 0.5 ) g t 2 {:(9.53)E[g^(2)]_(t)=0.5*E[g^(2)]_(t-1)+(1-0.5)*g_(t)^(2):}\begin{equation*} \mathbb{E}\left[g^{2}\right]_{t}=0.5 \cdot E\left[g^{2}\right]_{t-1}+(1-0.5) \cdot g_{t}^{2} \tag{9.53} \end{equation*}(9.53)E[g2]t=0.5E[g2]t1+(10.5)gt2
Second, it still uses an initial learning rate as AdaGrad and the Root Mean Square (RMS): R M S | g | t = E [ g 2 ] t + ϵ R M S | g | t = E g 2 t + ϵ RMS|g|_(t)=sqrt(E[g^(2)]_(t)+epsilon)R M S|g|_{t}=\sqrt{\mathbb{E}\left[g^{2}\right]_{t}+\epsilon}RMS|g|t=E[g2]t+ϵ. Then, the update equation is as follows.
(9.54) Δ w t = η E [ g 2 ] t + ϵ g t (9.54) Δ w t = η E g 2 t + ϵ g t {:(9.54)Deltaw_(t)=-(eta)/(sqrt(E[g^(2)]_(t)+epsilon))*g_(t):}\begin{equation*} \Delta w_{t}=-\frac{\eta}{\sqrt{\mathbb{E}\left[g^{2}\right]_{t}+\epsilon}} \cdot g_{t} \tag{9.54} \end{equation*}(9.54)Δwt=ηE[g2]t+ϵgt
The use of the initial (global) learning rate renders RMSprop suitable for unstable targets. Thus, it can generate relatively good results for RNN.
Adam
Adaptive Moment Estimation (Adam) is another method that can compute adaptive learning rates. Adam is essentially RMSprop with momentum. It uses the method of moments including the first and second order moments for calculating the learning rate.
First, the first order and second order moment estimation for computing the gradient is as follows:
(9.55) m t = μ m t 1 + ( 1 μ ) g t (9.56) n t = ν n t 1 + ( 1 ν ) g t 2 (9.55) m t = μ m t 1 + ( 1 μ ) g t (9.56) n t = ν n t 1 + ( 1 ν ) g t 2 {:[(9.55)m_(t)=mu*m_(t-1)+(1-mu)*g_(t)],[(9.56)n_(t)=nu*n_(t-1)+(1-nu)*g_(t)^(2)]:}\begin{align*} & m_{t}=\mu \cdot m_{t-1}+(1-\mu) \cdot g_{t} \tag{9.55}\\ & n_{t}=\nu \cdot n_{t-1}+(1-\nu) \cdot g_{t}^{2} \tag{9.56} \end{align*}(9.55)mt=μmt1+(1μ)gt(9.56)nt=νnt1+(1ν)gt2
The above moments are adjusted as follows:
(9.57) m ^ t = m t 1 μ t (9.58) n ^ t = n t 1 ν t (9.57) m ^ t = m t 1 μ t (9.58) n ^ t = n t 1 ν t {:[(9.57) hat(m)_(t)=(m_(t))/(1-mu^(t))],[(9.58) hat(n)_(t)=(n_(t))/(1-nu^(t))]:}\begin{align*} \hat{m}_{t} & =\frac{m_{t}}{1-\mu^{t}} \tag{9.57}\\ \hat{n}_{t} & =\frac{n_{t}}{1-\nu^{t}} \tag{9.58} \end{align*}(9.57)m^t=mt1μt(9.58)n^t=nt1νt
Then, the learning rate can be calculated as
(9.59) Δ w t = m ^ t n ^ t + ϵ η (9.59) Δ w t = m ^ t n ^ t + ϵ η {:(9.59)Deltaw_(t)=-( hat(m)_(t))/(sqrt( hat(n)_(t))+epsilon)*eta:}\begin{equation*} \Delta w_{t}=-\frac{\hat{m}_{t}}{\sqrt{\hat{n}_{t}}+\epsilon} \cdot \eta \tag{9.59} \end{equation*}(9.59)Δwt=m^tn^t+ϵη
Adam fixes the ranges of the learning rate due to the use of the above adjustments, leading to stable variations of the learning rate. It integrates AdaGrad's advantage in handling sparse gradients and RMSprop's advantage in dealing with stable targets. Besides, Adam has a relatively low demand for internal memory and can set adaptive learning rates for different parameters. It is applicable to most non-convex optimization problems and is suitable for large datasets and high-dimensional data.

Nadam

Nadam can be viewed as an Adam variation with Nesterov momentum.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models